Main goals of the session

  1. Understand the assumptions and requirements of the statistical tests of neutrality based on polymorphism and divergence data
  2. Calculate and interpreting the results of the HKA-test
  3. Calculate and interpreting the results of the MK-test

1. Practical organization

The main goal of this practical class is to reproduce a few examples of the application of the HKA (Hudson, Kreitman and Aguadé 1987) and MK (McDonald and Kreitman 1991) methods to test for selective neutrality in the recent past of different genomic regions. Both tests are based on important predictions of the Neutral Theory of Molecular Evolution (Kimura 1983). The HKA test contrasts the levels of nucleotide polymorphism (variability within population) with the levels of nucleotide divergence (fixed substitutions between species) for different loci (at least one of these loci is expected to be evolving under neutrality). All neutral loci across the genome must have the same ratio of polymorphism to divergence. This is because both estimates of nucleotide variation are proportional to the mutation rate. The MK test is based on the same assumption but it compares nucleotide variation at different classes of sites within the same gene (i.e. synonymous -or silent- versus non-synonymous sites) rather than across genes. If all amino acid substitutions between two species are neutral, we expect the same number of non-synonymous and synonymous (or silent) changes, the latter reflecting the neutral mutation rate. In this practice, you will work with polymorphism data from two different Drosophila species, Drosophila melanogaster and Drosophila subobscura, and divergence data from other two drosophilid species, Drosophila simulans and Drosophila guanche (see Fig. 1).


Figure 1. Drosophila species used in this practical lesson and their phylogenetic relationships.


Throughout the document you will see different icons whose meaning is:

: Additional or useful information

: Practical exercise

: Hint to solve an exercise or to do a task

: Slot to answer a question

: Problem or task to be solved


2. Installing R packages

You can use either the R console in the terminal or RStudio for this practice. If you don’t have R installed, you can download the appropriate package for your system here. To install RStudio, go to this page and follow the instructions.

Before starting the exercises, you will need to install some necessary libraries for manipulating and analyzing DNA sequence data. Open the R console in the terminal (or in RStudio) and type:

# install.packages("ape")
#devtools::install_github("pievos101/PopGenome")

#install.packages("remotes")
#remotes::install_github("andrewparkermorgan/sfsr")

3. Data files

  1. Data files: link and description (click on the file name to access the file) :

    • rpL32.fasta: This file contains the sequence of the rpL32 gene for 18 individuals (one sequence per individual) of a natural population of D. subobscura (J) and one sequence from a closely related species, D. guanche (Dgua). All sequences of D. subobscura are from the same chromosomal arrangement (O3+4). For this practical, use the 18 sequences of D. subobscura as the polymorphism data and the sequence of D. guanche for divergence estimations.

    • Acph.fasta. This file contains the sequence of the Acph gene for 42 individuals (one sequence per individual) of a natural population of D. subobscura (TB) and one sequence of a closely related species, D. guanche (Dgua). The sequences of D. subobscura are from two different chromosomal arrangements (O3+4+8 and O3+4+23). For this practical, use the 18 sequences of D. subobscura as the polymorphism data and the sequence of D. guanche for divergence estimations.

    • OS.fasta. This file contains the sequence of the OS locus of 14 individuals (one sequence per individual) of an European population of D. melanogaster (M) and 2 individuals of D. simulans (S). For this practical, use the 15 sequences of D. melanogaster as the polymorphism data and the sequence of D. simulans for divergence estimations.

    • E9.fasta. This file contains the sequence of the E9 locus of 15 individuals (one sequence per individual) of world wide sample of D. melanogaster (MEL_) and 1 individual (one sequence per individual) of D. simulans (SIM_). For this practical, use the 15 sequences of D. melanogaster as the polymorphism data and the sequence of D. simulans for divergence estimations.

    • E10.fasta. This file contains the sequence of the E10 locus for 15 individuals (one sequence per individual) of an African population of D. melanogaster (mel-) and one sequence from a closely related species, D. simulans (Dsim). For this practical, use the 15 sequences of D. melanogaster as the polymorphism data and the sequence of D. simulans for divergence estimations.


3. Example 1. The HKA test

Under neutrality, the amount of polymorphism in a species should be correlated with the levels of divergence between species for all loci across the genome due to the dependence of both on the neutral mutation rate (Kimura 1983). The HKA test evaluates the fit of the observed polymorphism and divergence data to this specific prediction (Fig. 2).


Figure 2. The HKA test.


Open the R console in the terminal (or RSsudio), load the library PopGenome, create a new folder for the gene, copy the fasta file to this folder, and read the folder with the function readData. This function creates an object of class GENOME. Then, indicate which samples are from the population of D. subobscura and of D.guanche using the function set.populatons.

library(PopGenome)
  
## create a new working directory for this gene region and copy the fasta file to this folder
dir.create("rpL32")
file.copy(from = "rpL32.fasta",
      to="rpL32/rpL32.fasta")

## read the fasta file
rp <- readData("rpL32")
  
## summary of the data
get.sum.data(rp)
  
## set populations
get.individuals(rp)
rp <- set.populations(rp,list(
        c(get.individuals(rp)[[1]][1:18]), 
        c(get.individuals(rp)[[1]][19]))
        )
rp@region.data@populations
rp@region.data@populations2
## [[1]]
## [[1]][[1]]
##  [1] "J15" "J16" "J17" "J18" "J19" "J20" "J21" "J22" "J23" "J24" "J25" "J26"
## [13] "J27" "J28" "J29" "J30" "J31" "J32"
## 
## [[1]][[2]]
## [1] "Dgua"
rp<-set.outgroup(rp,c("Dgua"))
## |            :            |            :            | 100 %
## |====================================================

To know which indices to use in the function set.populations(), we use the function get.individuals(). Check the result of the commands region.data@populations and region.data@populations2 to make sure you have set the populations correctly.

First, let’s have a look at the variability in the region we are studying. The sliding windonw analysis is very useful for this task. To examine the distribution of the nucleotide polymorphism across the rp32L region, use the sliding.window.transform() function to create a new object (rpsw) where regions correspond to a set of windows into which you divide your region (in the rp object, regions correspond to the entire gene region). You will set a window size of 200 bp and a step size of 10 bp. For example, you can estimate and plot the number of segregating sites across the gene region:

## divide the region in windows
rpsw <- sliding.window.transform(rp,width=200, jump=10,type=2,whole.data=TRUE)

## calculate segregating sites (and other statistics) 
rpsw <- neutrality.stats(rpsw)

## calculate fixed diferences (and other statistics)
rpsw<-calc.fixed.shared(rpsw)

## get segregating sites in D. subobscura
s<-as.data.frame(rpsw@n.segregating.sites[,1])

## get fixed differences between D. subobscura and D. melanogaster
d <- as.data.frame(rpsw@n.fixed.sites)
## plot the results of the sliding windows analysis
PopGplot(s,span=0.1,xlab="position (rp32L gene region)", ylim=c(min(s,na.rm=TRUE),max(s,na.rm=TRUE)))
lines(d[, 1])

legend("topright", legend = c("polymorphism", "divergence"), 
col = c("red", "black"), lty = 1)

Questions

1. Is the observed variation (visually) in agreement with expectations under the neutral theory of molecular evolution? Why?

Answer:

The synchronization of the graphs represents the correlation between polymorphism and divergence within and between species, which suggests that the neutral mutation rate is the primary driver.

2. What would the plot look like if positive selection (=selective sweep) had recently acted on the 3’ half of the gene? and if balancing selection had acted on the 5’ half of the gene??

Answer:

If positive selection had recently acted on the 3' half of the gene we would observe a heavy reduction in this region beacuse of a possible fixation. If balancing selection acted on the 5' half of the gene there would be an increase in variations, as  it means that there is no fixation

To make sure you are on the right track, you can apply the HKA test to both halves of the rpL32 gene region. The hka() function from the sfsr package can be used to do this. However, this function only accepts the site frequency spectrum (SFS) as input for this calculation. Therefore, you must first obtain the SFS of the regions you wish to compare in the HKA test. Just to give you an example, in code bellow you can see how to obtain the SFS and perform the HKA test on the entire rpL32 gene:

library(sfsr)

## extract the sample size
n<-length(rp@populations[[1]])

## get the minor allele frequencies
rp<-detail.stats(rp)
ma<-as.data.frame(rp@region.stats@minor.allele.freqs[1]) ## IMPORTANT: see info section bellow

ma<-ma[1,]

## obtain the unfolded site frequency spectrum in population 1 (including fixed differences with respect to poopulation 2)
ma<-ma*n
sfs<-c()
for (i in 1:n) {
    a<-sum(ma==i)
    sfs<-c(sfs, a)
}
sfs
##  [1] 24 11  2  1  0  1  1  2  2  0  0  0  0  1  2  0  3 43
monomorphic<-rp@n.valid.sites - sum(sfs)
rp.sfs<-c(monomorphic)
rp.sfs<-c(rp.sfs, sfs)
    
    
## plot the SFS
cols <- c("blue", "red")[(sfs >= sfs[n]) + 1 ]
barplot(sfs, main="Site Frequency Spectrum", names.arg=c(1:n), col=cols)
legend("topleft", legend = c("polymorphic", "fixed"), fill=c("blue","red"))

## run the HKA test:
rp.sfs2<-rp.sfs ## IMPORTANT: see info section bellow
hka_rp<-hka_test(rp.sfs, rp.sfs2)
## p-value of the HKA test
hka_rp$p.value
## [1] 1
## observed values
hka_rp$observed
##      sa div
## [1,] 50  54
## [2,] 50  54
## expected values
hka_rp$expected
##      esa ed
## [1,]  50 54
## [2,]  50 54
## residuals
hka_rp$residuals
##      sa div
## [1,]  0   0
## [2,]  0   0

IMPORTANT: note that in the example above we have compared one SFS against itself, which doesn’t make any sense!!!. In real cases, you must to compare the SFS from two different regions, one of which should be evolving under neutrality. Also note that when you have more than one fasta file in the input folder, the object contains as many regions as fasta files in the folder. Use get.sum.data() to know the order of each region and use the correct index when extracting minor.allele.freqs.

Exercise 1

  • Using the functions and commands you already know (from this and the previous practical), create two fasta files, one with the sequences of the 5’ half (e.g. “rpL32_5.fasta”) and the other with the sequences of the 3’ half (e.g. “rpL32L_3.fasta”) of the rpL32 gene region.
    • Since the alignment is 1798 sites long, divide the region into two parts of exactly equal length (1-899 and 900-1798).
  • Apply the HKA test to the two halves of the rpL32 gene and answer the following questions:
library(ape)
#read the file as a sequence, divide it in 2
rp_seq <- read.dna("rpL32.fasta", format = "fasta")
rp_3 <- rp_seq[,c(1:899)]
rp_5 <- rp_seq[,c(900:1798)]

#write each part into a file
write.dna(rp_3, file = "rpL32_3.fasta", format = "fasta")
write.dna(rp_5, file = "rpL32_5.fasta", format = "fasta")

#place botg files in the same folder
dir.create("rpL32_3")
file.copy(from = "rpL32_3.fasta",
      to= "rpL32_3/rpL32_3.fasta")

dir.create("rpL32_5")
file.copy(from = "rpL32_5.fasta",
      to= "rpL32_5/rpL32_5.fasta")
rp5 <- readData("rpL32_5")
rp3 <- readData("rpL32_3")

# Set populations
rp5 <- set.populations(rp5, list(c(get.individuals(rp5)[[1]][1:18]), c(get.individuals(rp5)[[1]][19])))
rp5 <- set.outgroup(rp5, c("Dgua"))

rp3 <- set.populations(rp3, list(c(get.individuals(rp3)[[1]][1:18]), c(get.individuals(rp3)[[1]][19])))
rp3 <- set.outgroup(rp3, c("Dgua"))


calculate_sfs <- function(rp, pop_index = 1) {
  n <- length(rp@populations[[pop_index]])
  
  # Get the minor allele frequencies
  rp <- detail.stats(rp)
  ma <- as.data.frame(rp@region.stats@minor.allele.freqs[[1]])
  ma <- ma[1,]
  
  # Obtain the unfolded site frequency spectrum
  ma <- ma * n
  sfs <- sapply(1:n, function(i) sum(ma == i))
  
  monomorphic <- rp@n.valid.sites - sum(sfs)
  rp_sfs <- c(monomorphic, sfs)
  
  return(rp_sfs)
}

# Calculate SFS for the two parts
rp5_sfs <- calculate_sfs(rp5)
rp3_sfs <- calculate_sfs(rp3)
# Perform the HKA test
hka_rp <- hka_test(rp5_sfs, rp3_sfs)

3. Do the two halves of the rpL32 gene region evolve as expected under the neutral model?

Answer:

P-value of the HKA test: 0.9472453 
Observed values: 8 42 8.277778 45.72222 
Expected values: 7.825855 42.17415 8.451923 45.54808 
Residuals: 0.06225093 -0.02681567 -0.05990099 0.02580339 

4. Is there any chance that the region being analysed has been subject to positive selection? Maybe even the entire region?

Answer:

Positive selection is not likely to have occurred in the region under study, as indicated by the high p-value of 0.94. Based on available data, it appears that this area is changing neutrally.

3. The MK test

The MK test also compares polymorphism and divergence data but, in this case, between two types of sites within the same gene (coding region), synonymous (neutral class) and nonsynonymous sites (Figure 3). Under the null hypothesis, all nonsynonymous mutations are expected to be neutral and then the ratio of nonsynonymous to synonymous variation within species (Pn/Ps) is expected to equal the ratio of nonsynonymous to synonymous variation between species (Dn/Ds). However, these ratios will not be equal if part of the nonsynonymous variation is under either positive (i.e., mutations that increase individual fitness) or negative selection (i.e., mutations that are negatively selected but not highly deleterious).


Figure 3. The MK test.


We will use the same genomic region to see an example of how the MK test is applied to real data. The PopGenome package has a function to run this test in R. As noted above, the test is based on synonymous and non-synonymous polymorphisms and substitutions, so the first task is to extract the coding regions from the alignment of the whole genomic region (the original alignment includes non-coding regions):

    library(ape)
    
    ## Create new folder for the cds file in the working directory of this gene
    dir.create("rpL32/rpL32_cds")
    
    ## create a new alignment only with cds sequences
    aln<-read.dna(file="rpL32/rpL32.fasta", format="fasta")
    cds<-aln[,c(932:1024,1122:1430)]
    write.dna(cds, file="rpL32/rpL32_cds/rpL32_cds.fasta", format="fasta")
    
    ## read the cds sequences with PopGenome
    rpc<-readData("rpL32/rpL32_cds")
      
    ## summary of the data
    get.sum.data(rpc)
      
    ## set populations
    get.individuals(rpc)
    rpc <- set.populations(rpc,list(
            c(get.individuals(rpc)[[1]][1:18]), 
            c(get.individuals(rpc)[[1]][19]))
            )
    rpc<-set.outgroup(rpc,c("Dgua"))
    
    ## MK test - fisher test
    rpc<-MKT(rpc, do.fisher.test=TRUE)
    get.MKT(rpc)

Questions

5. What is the result of this test? Can we infer the action of positive selection from the divergence of rpL32 between these two species?

Based on this data, we cannot infer that positive selection has influenced the divergence of rpL32 between these two species. The results indicate that the rpL32 gene is evolving neutrally in this context.

6. How many amino acid changes have occurred in the RpL32 protein since the divergence of D. subobscura and D. guanche? And how many synonymous changes? Discuss these results in relation to the evolutionary rate estimated for this protein in the final assignment of practical 4).

Answer:

Following the divergence of both species, the RpL32 protein has 0 amino acid changes and just 1 synonymous change. This also indicates low evolutionary rate.

Final assignment

  • Run an HKA test comparing the E9 and E10 gene regions of D. melanogaster (using D. simulans for divergence calculations)
  • Run the MK test in the coding region of the OS gene of D. melanogaster (using D. simulans for divergence calculations).
    • The coordinates of the coding region in the alignment are: 2334:2405,2468:2542,2595:2868.
  • Run the MK test in the coding region of the Acph gene of D. subobscura (using D. guanche for divergence calculations).
    • The coordinates of the coding region in the alignment are: 292:475,604:702,769:1458,1522:1780,1838:1944.
# Create directories
dir.create("E9_folder")
dir.create("E10_folder")

# Move the fasta files to the new directories
file.copy("E9.fasta", "E9_folder/E9.fasta")
file.copy("E10.fasta", "E10_folder/E10.fasta")

# Read the fasta files into GENOME objects
e9 <- readData("E9_folder")
e10 <- readData("E10_folder")
# Summary of the data
get.sum.data(e9)
##          n.sites n.biallelic.sites n.gaps n.unknowns n.valid.sites
## E9.fasta    2043               128     18          0          2024
##          n.polyallelic.sites trans.transv.ratio
## E9.fasta                   1           1.031746
get.sum.data(e10)
##           n.sites n.biallelic.sites n.gaps n.unknowns n.valid.sites
## E10.fasta    2160               161    144          0          2011
##           n.polyallelic.sites trans.transv.ratio
## E10.fasta                   5           1.146667
# prepare to populations for e9
get.individuals(e9)
## [[1]]
##  [1] "MEL_AUS2"  "MEL_AUS4"  "MEL_AUS5"  "MEL_CAN"   "MEL_CYP"   "MEL_IRAK" 
##  [7] "MEL_ITA"   "MEL_KENIA" "MEL_NAMAN" "MEL_SPAIN" "MEL_USA1"  "MEL_USA2" 
## [13] "MEL_USA3"  "MEL_USSR1" "MEL_ZIM33" "SIM_ZIM70"
# prepare to set populations for e10
get.individuals(e10)
## [[1]]
##  [1] "mel-zimbab5"  "mel-zimbab12" "mel-zimbab13" "mel-zimbab16" "mel-zimbab21"
##  [6] "mel-zimbab27" "mel-zimbab35" "mel-zimbab36" "mel-zimbab50" "mel-zimbab58"
## [11] "mel-zimbab60" "mel-kenya11"  "mel-kenya12"  "mel-kenya37"  "mel-kenya38" 
## [16] "Dsim"
# Set populations for e9
e9 <- set.populations(e9, list(
  c(get.individuals(e9)[[1]][1:15]), 
  c(get.individuals(e9)[[1]][16])
))
e9 <- set.outgroup(e9, c("SIM_ZIM70"))

# Set populations for e10
e10 <- set.populations(e10, list(
  c(get.individuals(e10)[[1]][1:15]), 
  c(get.individuals(e10)[[1]][16])
))
e10 <- set.outgroup(e10, c("Dsim"))


# Function to calculate SFS for a specific population and region
calculate_sfs <- function(rp, pop_index, region_index) {
  # Extract the sample size
  n <- length(rp@populations[[pop_index]])
  
  # Get the minor allele frequencies
  rp <- detail.stats(rp)
  ma <- as.data.frame(rp@region.stats@minor.allele.freqs[[region_index]])
  ma <- ma[1,]
  
  # Obtain the unfolded site frequency spectrum
  ma <- ma * n
  sfs <- sapply(1:n, function(i) sum(ma == i))
  
  monomorphic <- rp@n.valid.sites - sum(sfs)
  rp_sfs <- c(monomorphic, sfs)
  
  return(rp_sfs)
}


# Calculate SFS for two different regions
e9_sfs <- calculate_sfs(e9, 1, 1) 
e10_sfs <- calculate_sfs(e10, 1, 1) 

# Run the HKA test
hka_e <- hka_test(e9_sfs, e10_sfs)
P-value of the HKA test: 0.001289523 
Observed values: 9 78 121.3333 103.5333 
Expected values: 36.35849 50.64151 93.97485 130.8918 
Residuals: -4.537213 3.84449 2.82219 -2.39131 

Based on the results of these test, answer the following questions:

7. Are the E9 or E10 gene regions under positive selection in D. melanogaster?

Answer:

The observed E9 and E10 regions are not evolving neutrally (p-value of 0.001289523 in HKA test). It is unclear if there is positive or negative selection, as its indistinguishable in the HKA test results.

8. If I say that the region referred to E10 is evolving under neutrality (it is a noncoding region with no functional element), does this modify your answer to question 7?

Answer:

If E10 is neutral, then the deviation in E9 indicates that its positive selection. This because E9 shows greater divergence/polymorhism than expected by reference to E10

9. Is there evidence of selection in two genes analysed by the MK test? What type of selection (positive or negative)?.

Answer:

get.individuals(e9)
## [[1]]
##  [1] "MEL_AUS2"  "MEL_AUS4"  "MEL_AUS5"  "MEL_CAN"   "MEL_CYP"   "MEL_IRAK" 
##  [7] "MEL_ITA"   "MEL_KENIA" "MEL_NAMAN" "MEL_SPAIN" "MEL_USA1"  "MEL_USA2" 
## [13] "MEL_USA3"  "MEL_USSR1" "MEL_ZIM33" "SIM_ZIM70"
get.individuals(e10)
## [[1]]
##  [1] "mel-zimbab5"  "mel-zimbab12" "mel-zimbab13" "mel-zimbab16" "mel-zimbab21"
##  [6] "mel-zimbab27" "mel-zimbab35" "mel-zimbab36" "mel-zimbab50" "mel-zimbab58"
## [11] "mel-zimbab60" "mel-kenya11"  "mel-kenya12"  "mel-kenya37"  "mel-kenya38" 
## [16] "Dsim"
e9 <- set.populations(e9, list(
  c(get.individuals(e9)[[1]][1:15]), 
  c(get.individuals(e9)[[1]][16])
))
e9 <- set.outgroup(e9, c("SIM_ZIM70"))

e10 <- set.populations(e10, list(
  c(get.individuals(e10)[[1]][1:15]), 
  c(get.individuals(e10)[[1]][16])
))
e10 <- set.outgroup(e10, c("Dsim"))

e9 <- MKT(e9)
mk_e9 <- e9@MKT

e10 <- MKT(e10)
mk_e10 <- e10@MKT

results_e9 <- mk_e9[[1]]
results_e10 <- mk_e10[[1]]
E9 gene results:  8
E10 gene results:  42

MK for E9 is 8, meaning netral evolution, while the E10 MK result is 42, showing positive selection, that shows a fixation of advantageous amino acid changes between species.

Deliver info

Deliver this document in AULAESCI with your answers

Deadline: June 28, 2024 - 23:59

Doubts?